Alignment-Enriched Tuning for Patch-Level Pre-trained Document Image Models

نویسندگان

چکیده

Alignment between image and text has shown promising improvements on patch-level pre-trained document models. However, investigating more effective or finer-grained alignment techniques during pre-training requires a large amount of computation cost time. Thus, question naturally arises: Could we fine-tune the models adaptive to downstream tasks with objectives achieve comparable better performance? In this paper, propose new model architecture alignment-enriched tuning (dubbed AETNet) upon models, adapt joint task-specific supervised alignment-aware contrastive objective. Specifically, introduce an extra visual transformer as alignment-ware encoder before multimodal fusion. We consider in following three aspects: 1) document-level by leveraging cross-modal intra-modal loss; 2) global-local for modeling localized structural information images; 3) local-level accurate information. Experiments various show that AETNet can state-of-the-art performance tasks. Notably, consistently outperforms such LayoutLMv3 fine-tuning techniques, different Code is available at https://github.com/MAEHCM/AET.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ImageNet pre-trained models with batch normalization

Convolutional neural networks (CNN) pre-trained on ImageNet are the backbone of most state-of-the-art approaches. In this paper, we present a new set of pretrained models with popular state-of-the-art architectures for the Caffe framework. The first release includes Residual Networks (ResNets) with generation script as well as the batch-normalization-variants of AlexNet and VGG19. All models ou...

متن کامل

Multi-Level Structured Models for Document-Level Sentiment Classification

In this paper, we investigate structured models for document-level sentiment classification. When predicting the sentiment of a subjective document (e.g., as positive or negative), it is well known that not all sentences are equally discriminative or informative. But identifying the useful sentences automatically is itself a difficult learning problem. This paper proposes a joint two-level appr...

متن کامل

Learning Surrogate Models of Document Image Quality Metrics for Automated Document Image Processing

Computation of document image quality metrics often depends upon the availability of a ground truth image corresponding to the document. This limits the applicability of quality metrics in applications such as hyperparameter optimization of image processing algorithms that operate on-the-fly on unseen documents. This work proposes the use of surrogate models to learn the behavior of a given doc...

متن کامل

Image alignment via kernelized feature learning

Machine learning is an application of artificial intelligence that is able to automatically learn and improve from experience without being explicitly programmed. The primary assumption for most of the machine learning algorithms is that the training set (source domain) and the test set (target domain) follow from the same probability distribution. However, in most of the real-world application...

متن کامل

Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment

The quality of a statistical machine translation (SMT) system is heavily dependent upon the amount of parallel sentences used in training. In recent years, there have been several approaches developed for obtaining parallel sentences from non-parallel, or comparable data, such as news articles published within the same time period (Munteanu and Marcu, 2005), or web pages with a similar structur...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2023

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v37i2.25357